The Experiment of Thai Document Indexing and Clustering for VLSHDS Project
نویسندگان
چکیده
This paper presents the VSLHDS project (Very Large Scale Hypermedia Delivery System), one network collaboration research between NII and NAiST. The goal of this project is the study design and implementation of a very large scale hypermedia delivery system integrated with Natural Language Processing based approach to Thai text retrieval. Distance learning tutorials have been setup between NII and NAiST as part of Mlabnet in order to disseminate the knowledge and the use of the AHYDS-Advanced Hypermedia Delivery Systemand Phasme engine used inside the VLSHDS project. At the current state, a first prototype of VLSHDS integrating text retrieval capabilities for Thai, Japanese and English Language management has been designed and implemented. In addition to Thai document processing, NAiST also plans to provide Thai Language Processing Service by using PHASME application-oriented service functions and AHYDS component and to extend the research collaboration by cooperatively developing the “Global Medical Plant Garden”.
منابع مشابه
A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts
This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There ar...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملAn Enhancement of Thai Text Retrieval Efficiency by Automatic Backward Transliteration
Loan words, which are borrowed from foreign languages, are used in many languages such as Japanese, Chinese, Korean and Thai. They have effects on Thai Text Retrieval (TTR) system leading to inaccurate terms weight for indexing and text clustering. Therefore, there is a need to create automatic backward transliteration that can solve this problem. In this paper, we propose a hybrid model approa...
متن کاملHybrid Document Indexing with Spectral Embedding
Document representation has a large impact on the performance of document retrieval and clustering algorithms. We propose a hybrid document indexing scheme that combines the traditional bagof-words representation with spectral embedding. This method accounts for the specifics of the document collection and also uses semantic similarity information based on a large scale statistical analysis. Cl...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کامل